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The particularity of our faces encourages many researchers to exploit their 
features in different domains such as user identification, behaviour analysis, 
computer technology, security, and psychology. In this paper, we present 
a method for facial attributes analysis. The work addressed to analyse facial 
images and extract features in the purpose to recognize demographic attributes: 
age, gender, and ethnicity (AGE). In this work, we exploited the robustness 
of deep learning (DL) using an updating version of autoencoders called 
the deep sparse autoencoder (DSAE). In this work we used a new architecture 
of DSAE by adding the supervision to the classic model and we control 
the overfitting problem by regularizing the model. The pass from DSAE 
to the semi-supervised autoencoder (DSSAE) facilitates the supervision 
process and achieves an excellent performance to extract features. In this 


work we focused to estimate AGE jointly. The experiment results show 
that DSSAE is created to recognize facial features with high precision. 
The whole system achieves good performance and important rates in AGE 
using the MORPH II database 


Supervised autoencoder 
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1. INTRODUCTION 

Recently with the growth and the development of technologies, intelligent and recommended systems 
became in the center of researches. We are in the epoch when systems must know and understand his/her 
current user. Nowadays user identification is an essential challenge for researchers; it presents a necessity in 
new technologies in many domains such as security, computer access control, e-commerce, banking, human 
machine interaction (HMI), medicine, social media, applicant identification, civil protection, crimes, terrorism, 
and most recently the fight against social fraud, etc. [1]. 

In HMI, the response for the question “who is the user?” is very complicated and need an excellent 
precision because it is different from an area to another. Generally, in user analysis including user profiling 
and user modeling, we need to know the user’ age, gender, race, emotion, actual behavior, cultural level, 
sensory abilities, and his/her experiences. It is a big and multidisciplinary challenge but very important because 
we need to make machines able to understand, analyze the user’s needs and also to adapt itself to his needs 
and capabilities [2]. The challenge of these new technologies is to increase the effectiveness and robustness 
and give the precise, right and exact response in the right moment even were the conditions. To know this user 
and understand his/her sensorial capabilities, physical abilities, affective state, social and cultural level, many 
researchers used faces as an important stimulus for their works due to the particularity and distinctiveness 
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features [3]. Facial features have the advantage of being unique and permanent and cannot be falsified unlike 
conventional means such as passwords or badges. The face is an essential informer of identity and a basis for 
identifying people (identity photography, anthropometry, facial recognition. In this work we present an 
algorithm for user identification focuses on determining the AGE. This paper is organized as follows: 
in the next section, related works are described. In section 3, we present the proposed method. In section 4, 
performance and results are reported. Finally, conclusions are drawn in section 5. 


2. RELATED WORKS 

In this section, we review the existing AGE recognition works. All the presented previous work cited 
in this section based on the use of DL architectures. In age estimation researches, posture vocabulary and 
intonation present significant elements to predict the age of interlocutor, but face still the most important source 
of information to estimate the real age; we can extract an efficient modulation of the individual just by looking 
to his face. In HCI, age plays an important role in producing effective and robust interfaces in the recommended 
system, adaptive interface, smart technologies and embodied recognition. Gender recognition is also 
an important factor in user identification, and many researchers exploit different biometric techniques for 
gender identification. Gender recognition, based on 2D or 3D images, is part of biometric technologies that 
can be efficient information to precise the individual identity. Such as age and gender recognition, ethnicity 
presents an important attribute in user identification in many types of research, especially in security. 

The notion of ethnicity was used from the eighteenth years to differentiate individual groups having 
different physical criteria. In literature, many researchers exploit facial features to estimate AGE, for example 
in their article, Jordi et al. [1] presented a novel method for gender identification using the deep neural network 
(DNN), the new architecture proposed in their work based on the use of local features using small overlapping 
region. The Local DNN was tested on LFW and Gallagher’s database and gives an important result especially 
using four layers; the difference was substantial compared with the network with one layer. In 2016 Manepali 
et al. presented a novel method of age estimation with a real image, different poses and different emotions 
using LFW, Groups, and FERET datasets. In this method, a dictionary is produced from the training phase, 
and matching is completed by rebuilding the testing image using a sparse dictionary. Kaya et al. [2] presented 
an algorithm of AGE recognition for children throw speech they used a dataset contains the sequence for children 
with ages between three and seven years in a different emotional state (comfort, discomfort and neutral). 

The classification process was applied using extreme machine learning (EML) with a single layer 
feedforward network (LFN). In their article Antipov et al. [3] present an algorithm of age and gender 
classification using convolutional neural network (CNN); they used three popular benchmarks LFW, FG-NET, 
and MORPH for the training process. In 2017 Lei Cai et al. [4] present a new architecture for gender recognition 
for pedestrians; to address the problem of illuminations, occlusion and poor-quality researchers used an 
effective HOG-assisted deep feature learning (HDFL). They exploit the deep-learned and weighted HOG 
feature extraction branches simultaneously on the input images. 


3. PROPOSED METHOD 
3.1. Overview of the proposed classification algorithm 

In the age estimation process, our goal is not to find the exact age but to find the age group. Therefore 
we describe three age groups; youth (16-30), senior (31-50) and elderly (51-over). For the ethnicity process, 
we classify the race into two classes: Caucasian and not Caucasian. We have three classes for ages, two for 
ethnicity and two for gender. The number of final classes is 12 as described in Figure 1 and organized as follow: 
not Caucasian female (NCF) from 16 to 30, not Caucasian from 31 to 50, not Caucasian female more than 50, 
caucasian female (CF) from 16 to 30, Caucasian female from 31 to 50, Caucasian female more than 50, Not 
Caucasian male (NCM) from 16 to 30, not Caucasian male from 31 to 50, not Caucasian male more than 50, 
Caucasian male (CM) from 16 to 30, Caucasian male from 31 to 50, Caucasian male more than 50. In this 
work, we start by data pre-processing: The first step in our work is to extract the face from the images, for this 
reason, we used the AdaBoost framework of Viola P. and Jones [5] published on July 13 2001. 

The second step is to crop up faces. Finally, an in-plane rotation is applied to adjust the head 
orientation because it could influence the algorithm performance. The preprocessing applied to the two 
frameworks AGER and ER. In machine learning process, the main problem of classification is to distinguish 
to which of a set of groups a new sample belongs, by extracting features of a training set of data which contain 
some observations whose class membership is already known. In this axis, the revolution that artificial 
intelligence and especially machine learning are experiencing today, DL is at the forefront. Methods exploiting 
the theory of DNN in automatic learning have proved their robustness on complex tasks in the fields of imaging 
and acoustic processing. 
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Figure 1. Different classes used for age, gender and ethnicity recognition 


3.2. Autoencoder model 

The autoencoders present a robust architecture of DL today, as described in [6], autoencoders are 
structures composed of two parts: an encoder and a decoder, they are built with deep architectures. The number 
of neurons in the last decoder layer is equal to the size of the network input. The purpose of an autoencoder is 
to find a coded representation of an input that can be decoded accurately. Such a network is driven to find 
a representation of the input data and to learn the connection between an entry and its hidden representation. 
We consider the input vector X transformed into a hidden representation as follows: 


z=f(x) = sigm (WX +b) (1) 


where W and b presents the weight matrix and the bias between the input and hidden layer, respectively 
and the mapping was through sigmoid function or tanh function (sigmoid function is usually used as 
an activation function): 


O=sigm(y) = [A oF 


(2) 


2 — PV) -exp Cy) 
Ganis exp(y)+exp(-y) 


for the decoding stage, the hidden representation is mapped back to the first representation as following: 
X = g(z) = sigm (W’z +b’) (3) 


where W’ and b’ denote the weight matrix and the bias between the hidden and output layer respectively. 
The reconstruction error is defined by minimizing the Euclidean cost: 


argminy w||X — RI (4) 


The purpose of an autoencoder is to establish some correlation for the input data for dimensionality 
reduction after that a classification process is established. In literature, many types of autoencoders were 
presented and discussed; for example, progressive, sparse and denoising autoencoders. The sparse coding 
method presents a good representation and performance. In fact, the term sparse is used to indicate that we 
need hidden neurons with the same probability of activation. The number of neurons of the hidden layers is 
smaller than that of the input and output layers. It will be important to compress data and try to find a correlation 
between the data and therefore classify them according to this correlation. The minimized function of sparse 
autoencoder presented as following [6, 7]: 


argminw,w7llX — W8 (WX)|lF+B Xj21 KLCpllĝ;) (5) 


where m is the number of hidden nodes, and $ is a coefficient that determines the weight of sparsity penalty item. 
p is the sparsity parameter, it presents the target average activation of hidden units, which is generally a small 
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value nearing zero:p; = ayia h,(x;) denotes the average activation of hidden node j, and the Kullback-Leibler 
divergence can be defined by: 


A 1- 
KL(pll6;) = plogs + (1 — p) logs 


The purpose of training a sparse autoencoder is to learn the algorithm to automatically extract features 
from unlabeled data. Recently, many researchers considered that autoencoders may be semi-supervised 
because they could be more performant when they specify classes, their researches denote the performance of 
this new architecture and results shown the robustness of supervised autoencoders than many other ordinary 
networks [8-11]. In this work we used the sparse autoencoder in a semi-supervised manner to predict age, 
gender, ethnicity from facial images. In fact, the idea of semi supervised autoencoders or class specific 
autoencoder is to incorporate information about classes in the basic architecture of autoencoders in the purpose 
to improve the supervision process. We keep the same architecture of un-supervision process and we add 
labeled data to supervise the autoencoder and learn features such that the illustrations of samples referring to 
a class are equal to the mean representation of the same class. From the first researchers who worked with 
autoencoders in a supervised manner we cite Gao et al. in 2015 [10]; they modified denoising autoencodders 
to optimize the performance of identification. 

Specifically, the idea of semi-supervised autoencoder to manually specify the input features x given to 
the algorithm. Once a good feature representation is given, a learning algorithm can do well. The class-spesific 
autoencoder makes the features consist to the same class similaraty, in fact, it extracts more efficient features for 
the same group representation. As described previously, in the first step, we suggest to learn features such us 
they have the same sparsity signature across every class. For a given input X, the loss function in the traditional 
architecture of autoencoder is given as mentioned in the (4). In their work Majumdar et al. propose [12, 13] 11 
norm for regularization presented as follow: 


argminw,w7llX — W'O (WX)||z+A|[WXI ha (6) 


to incorporate the supervision into the classic architecture of autoencoders we have to pass from the unsupervising 
to the supervising by labelling data, however the training data could be presented us following: 


X= eral se [Zin x21| i [eee a Xal- [Xen] 


where the training data is divided into classes (c). The idea is to learn features into common sparse support. 
and WXi will be row sparse. This is achieved by incorporating l2, norm regularization as following: 


argminw,w7l|X — Ww’ (WX) |lF+AD -1 [TWXello1 +B Èc=1 KL(pll6,.) (7) 


where ||v|l21=)) ilv 2? ll, is the sum of ],-norms of the rows. We note that the input X appartain to class c during 


the classification phases, then taking into consederaion all classes we optimize w and w’ leading to a minumum 
of objectif function. The inner ],-norm promotes a dense (non-zero) solution within the selected rows, but 
the outer 11-norm (sum) enforces sparsity in selecting the rows [6, 13]. The proposed solution shown in (7) 
improuve row-sparsity within every group. The architecture of the proposed method is described in Figure 2. 


4. EXPERIMENTAL RESULTS 
4.1. MORPH II dataset 

For AGE we used the MORPH II and FG-NET database. MORPH II benchmark holds 55,000 images 
of more than 13,000 volunteers individuals, pass over from 2003 to late 2007. Ages group from 16 to 77 with 
a median age of 33. The average time taken between photos is 164 days and the average number of images per 
individual is 4 (the minimum being one day and the maximum is 1681 days). The standard deviation of days 
between images is 180. In literature, MORPH II database offers the most extensive database that combines 
AGE. The distributions of the face images are shown in Table 1. Differently, of previous studies on 
MORPH II, that divide the whole MORPH II database W into three sets, S1, S2, and S3, we split the facile 
images into 12 classes as shown in Figure 2. Some examples of facial images extracted from MORPH II dataset 
are shown in Figure 3, from different ages and ethnicities. In every image, we detect and crop the face area, 
and the image is dimensioned into 32*32. Only two ethnics groups are used for this work (Caucasian and not 
Caucasian we note that for elderly classes we added more than 350 images for each class because of the little 
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number of images for elderly in comparison with other groups. The illustration of facial images used in the test 


phase for the MORPHII databases is described in Table 1. 
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Figure 3. Examples of facial images of MORPH II database. 


Table 1. The illustration of facial images used in the test phase for the MORPHII databases 








Method Year MAE(MORPH) 
Flexible overlapped AAM+LPQ [6] 2015 5.68 
ODLF [12] 2017 3.12 
CSC+STD Pooling [14] 2017 3.66 
CSC+Max Pooling [14] 2017 3.78 
GA-DFL [15] 2016 3.37 
D2C [16] 2017 3.06 
Net VGG [17] 2017 2.96 
Mohammed et al. [18] 2019 3.17 
Taheri et al. [19] 2019 2.81 
Proposed method - 3.26 
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4.2. Age, gender and ethnicity recognition (AGER) 

This section is dedicated to evaluate the presented study on AGE and carry out the various 
comparisons with other recognition methods. Results of AGE presented separately in Table 2, Table 3 
and Table 4. The accuracies of AGE estimation jointly presents 73.5%. The best rates are given for NCM 
(30-50), CM (50+) and NCF (50+) with respectively 78.7%, 83.1% and 84.4%. The difficulty in the comparison 
of the proposed method (AGER jointly) with others is the lack of models estimating the AGE jointly. We used 
the DSSAE also to estimate AGE separately on the MORPH II database, and we compare with other DL 
models, as well as with some other approaches on MORPH II and other benchmarks. Let us look at the age 
estimation results shown in Table 2. Firstly, for testing and evaluating the proposed method, we used the Mean 
Absolute Error (MAE) to define the algorithm performance in age estimation it is calculated as follow: 


E= lly cl (8) 


where y’ and y present the predicted and real age value respectively and N denotes the number of 
the testing facial images. The purpose from our work is not to extract exactly the age but we look to just classify 
ages into three ranges, youth (16-30), senior (31-50) and elderly (51-over). The proposed method obtains 
an MAE of 3.26 years, which is considerable very important compared with other methods. This low MAE is 
even smaller considering the age estimation is performed on a very large database, this error reduction rate is 
statistically significant. The approach shown the performance on young and senior groups, the MAE presents 
the best result with less than and 3.26. 


Table 2. Accuracy of age recognition for MORPH II dataset (%) 








Method Year MAE(MORPH) 
Flexible overlapped AAM+LPQ [6] 2015 5.68 
ODLF [12] 2017 3.12 
CSC+STD Pooling [14] 2017 3.66 
CSC+Max Pooling [14] 2017 3.78 
GA-DFL [15] 2016 3.37 
D2C [16] 2017 3.06 
Net VGG [17] 2017 2.96 
Mohammed et al. [18] 2019 3.17 


Taheri et al. [19] 2019 2.81 
Proposed method - 3.26 





Table 3. Comparison of gender accuracy with the state-of-the-art methods (%) 








Method Year Accuracy 
Duan et al. [8] 2017 88.20 
Guo et al. [10] 2014 98.40 
Dhomne et al. [16] 2018 95.00 
Srinivas et al. [17] 2017 84.70 
Lee et al. [20] 2017 88.50 
Huang et al. [21] 2017 89.60 
Benini et al. [22] 2019 98.59 
Fang et al. [23] 2019 98.80 
Proposed method - 95.00 





Table 4. Accuracy of ethnicity recognition for MORPH II dataset (%) 








Method Year Accuracy 
Guo et al [10] 2014 99.00 
Uddin et al [13] 2016 95.40 
Srinivas et al [17] 2017 33.33 
Mohammed et al [18] 2019 93.3 
Hocquet et al [24] 2016 97.50 
Mohammed et al [25] 2017 94.60 
Proposed method - 98.20 





In our study, we classified gender with and without ethnicity consideration. The accuracy of gender 
recognition without ethnicity consideration with the DSSAE is 95%, this accuracy is considered an exciting 
result for two reasons. Firstly, the state of arts demonstrates that this rate is significant in comparison with 
other methods, such as mentioned in Table 3. Secondly, due to the composition of MORPH II benchmark, 
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this database contains facial images of different people with different looks such as a man with long hair 
females with short hair. Table 5 shows the confusion matrix of gender recognition with consideration of 
ethnicity. Facial images were devised into four classes; NCF, CF, NCF and CM. the accuracy increased by 
about 2% in comparison with gender recognition without ethnicity consideration. For ethnicity estimation, 
we only reported accuracies for the Caucasian and Not Caucasian, since other race groups like Asian and Indian 
race were not used in training because the number of images is very small. The classification groups into two 
ethnicity classes. The accuracy obtained is 98.2%; this rate is interesting in comparison with other methods 
as described in Table 4. 

In this article, we used the DSSAE as a classifier; we measured the accuracy that occurs when 
a classifier is tested with different hidden layers. The parameters of the deep neural network are investigated 
by altering the number of hidden layers, the number of neurons and the size of the training set. We carried out 
extensive experiments to determine the optimum parameters. The number of layers in the DNN is crucial for 
about 15.000 images. In this article, we used two models with a different number of hidden layers, and we 
examine the final results: we have the first model called mod1: we have two hidden layers and the second one 
called mod2 with Three layers. The results are summarized in Table 6. 


Table 5. Confusion matrix of gender throws ethnicity recognition (MORPH II) 








NCF CF NCM CM 
Not Caucasian Female 92.4 2.0 4.2 2.0 
Caucasian Female 2.1 94.6 0.0 3.2 
Not Caucasian Male 3.9 0.0 91.5 0.0 
Caucasian Male 1.6 3.4 43 94.8 





Table 6. Parameters used for the DSSAE architecture 








Parameters Layers size Regularization Sparsity Weight sparsity penalty (£) Accuracy 
term (A) parameter (p) 
Age, gender and Layer 1 50 0.002 0.5 5 
ethnicity jointly Layer 2 25 0.003 0.8 5 73.5% 





The presented work showed an improvement performance using DSSAE. This performance is 
explained by the two important metrics: firstly; the use of supervision under classes improved the accuracy. 
Results in our work and in literature shown that using the AE is very interesting in supervision manner. 
The second metric is the use of L1 and L2 norm to regularize the model. This metric helped us to regularize 
the model and reduce the overfitting problem. In other hand, the solution used in this work, enhances 
the sparsity in every class and consequently it improves the generalization of features for every class. 
The major problem encountered in this work is the number of samples used under every class, because dividing 
samples between the three attributes will decrease the total number in every class. 

This explain the low rate found in elderly class where we have very low number of samples make 
the model unable to extract more features and generalize. This why we added samples in elderly classes as 
explained in section (4.1) and the accuracy was increased from 52.25% to 63.57% (NCM 50+). It can be 
explained as kind of data augmentation metric. In the young and senior classes we have found a rate for more 
than 80% for all attributes. In these classes we have large number of samples perform the model to generalize. 
But in comparison of these rates with found results for facial attributes separately is more interesting. 


5. CONCLUSION 

In this article, we present a method for AGE. We used autoencoders for classification. The work 
consists of using faces from MORPH II databases to recognize AGE jointly by classifying facial images into 
12 classes to find three demographic attributes (age, gender and ethnicity). The classification model was based 
on updating version of autoencoder called DSSAE. In this model we are trying to exploit the advantages of 
supervised and unsupervised learning in the same time. The experiments are conducted on an extensive 
database containing more than 55,000 face images. And they show the robustness of our method as 
classification model to find the three attributes separately. 
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